To start this project, I have to load in the packages and data that I am going to use. I will explain what I am doing and what the variables mean below.

library(ggplot2)
library(caret)
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Reading in the data and also Home Runs 1 and 2 load as charecters because of the NAs, which will be ignored and I’ll get to that later, but they need to be integers and lets count the total NAs we are dealing with and double check everything is ready for data analysis. I made this by exporting data into a csv file from FanGraphs after sorting columns and excluding players with less than 200 plate appearances and performing a VLOOKUP for past years Home Run totals in a seperate notebook and combined it into one. The data prep took roughly an hour but was pretty painless.

library(readr)
HomeRuns <- read_csv('data/HomeRunsPredictorDataRevised.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   HR = col_integer(),
##   Age = col_integer(),
##   PA = col_integer(),
##   Doubles = col_integer(),
##   HR1 = col_character(),
##   HR2 = col_character()
## )
## See spec(...) for full column specifications.
HomeRuns$HR1 <- as.integer(HomeRuns$HR1)
## Warning: NAs introduced by coercion
HomeRuns$HR2 <- as.integer(HomeRuns$HR2)
## Warning: NAs introduced by coercion
sum(is.na(HomeRuns$HR1))
## [1] 78
sum(is.na(HomeRuns$HR2))
## [1] 130
str(HomeRuns)
## Classes 'tbl_df', 'tbl' and 'data.frame':    355 obs. of  38 variables:
##  $ Name       : chr  "Max Muncy" "Juan Soto" "Shohei Ohtani" "Ronald Acuna Jr." ...
##  $ HR         : int  35 22 22 26 3 10 7 27 16 24 ...
##  $ Age        : int  27 19 23 20 26 27 24 23 22 21 ...
##  $ PA         : int  481 494 367 487 248 221 334 606 285 484 ...
##  $ Doubles    : int  17 25 21 26 11 14 16 47 9 16 ...
##  $ BBPct      : num  0.164 0.16 0.101 0.092 0.056 0.118 0.147 0.041 0.084 0.087 ...
##  $ KPct       : num  0.272 0.2 0.278 0.253 0.097 0.249 0.138 0.16 0.281 0.252 ...
##  $ BB_K       : num  0.6 0.8 0.36 0.37 0.58 0.47 1.07 0.26 0.3 0.34 ...
##  $ OBP        : num  0.391 0.406 0.361 0.366 0.381 0.357 0.405 0.328 0.34 0.34 ...
##  $ BABIP      : num  0.299 0.338 0.35 0.352 0.359 0.315 0.336 0.316 0.345 0.321 ...
##  $ GB_FB      : num  0.76 1.87 1.32 1.07 0.97 1.23 1.24 1.23 1.65 0.77 ...
##  $ LDPct      : num  0.208 0.175 0.236 0.183 0.216 0.213 0.24 0.202 0.21 0.245 ...
##  $ GBPct      : num  0.343 0.537 0.436 0.423 0.387 0.434 0.421 0.44 0.492 0.328 ...
##  $ FBPct      : num  0.449 0.288 0.329 0.394 0.397 0.353 0.339 0.358 0.298 0.427 ...
##  $ HR_FB      : num  0.294 0.247 0.297 0.211 0.038 0.208 0.089 0.157 0.296 0.179 ...
##  $ wFB        : num  27.1 32.1 18.5 17.3 0.7 7.5 10.4 12.1 6.9 13.1 ...
##  $ wSL        : num  -1.8 -3 0.1 0.3 3.1 0 1.3 8 3.3 -2.7 ...
##  $ wCT        : num  3.1 -0.3 -1.2 1.9 0.9 0.6 -0.3 2.2 -0.5 -1.3 ...
##  $ wCB        : num  -0.3 1 2.7 4.2 2.3 0.9 1.9 5.9 1.5 2.1 ...
##  $ wCH        : num  4.2 0 2.5 7 4.3 0 0.7 1 -0.8 1 ...
##  $ wSF        : num  4.1 -0.9 0.5 -0.2 1 0.7 0.9 -1.4 NA -0.5 ...
##  $ OSwingPct  : num  0.215 0.219 0.323 0.275 0.353 0.234 0.222 0.394 0.317 0.344 ...
##  $ ZSwingPct  : num  0.578 0.607 0.653 0.728 0.842 0.653 0.653 0.74 0.691 0.687 ...
##  $ SwingPct   : num  0.37 0.388 0.457 0.461 0.56 0.409 0.408 0.531 0.464 0.484 ...
##  $ OContactPct: num  0.58 0.681 0.591 0.598 0.749 0.512 0.704 0.697 0.543 0.559 ...
##  $ ZContactPct: num  0.804 0.857 0.806 0.827 0.909 0.832 0.924 0.918 0.81 0.818 ...
##  $ ContactPct : num  0.729 0.801 0.716 0.746 0.851 0.725 0.856 0.819 0.699 0.709 ...
##  $ ZonePct    : num  0.426 0.436 0.406 0.409 0.422 0.418 0.431 0.396 0.393 0.409 ...
##  $ FStrikePct : num  0.559 0.575 0.583 0.62 0.645 0.566 0.542 0.663 0.632 0.622 ...
##  $ SwStrPct   : num  0.1 0.077 0.13 0.117 0.084 0.112 0.059 0.096 0.14 0.141 ...
##  $ PullPct    : num  0.447 0.361 0.369 0.438 0.356 0.416 0.371 0.475 0.337 0.422 ...
##  $ CentPct    : num  0.312 0.364 0.373 0.361 0.351 0.307 0.371 0.292 0.431 0.327 ...
##  $ OppoPct    : num  0.241 0.275 0.258 0.201 0.293 0.277 0.257 0.233 0.232 0.251 ...
##  $ SoftPct    : num  0.124 0.203 0.102 0.137 0.22 0.139 0.118 0.194 0.149 0.14 ...
##  $ MedPct     : num  0.402 0.449 0.467 0.419 0.478 0.423 0.443 0.446 0.409 0.476 ...
##  $ HardPct    : num  0.474 0.348 0.431 0.444 0.302 0.438 0.439 0.36 0.442 0.384 ...
##  $ HR1        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HR2        : int  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 38
##   .. ..$ Name       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HR         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Age        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PA         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Doubles    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BBPct      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ KPct       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ BB_K       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ OBP        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ BABIP      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ GB_FB      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ LDPct      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ GBPct      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ FBPct      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ HR_FB      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wFB        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wSL        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wCT        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wCB        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wCH        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ wSF        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ OSwingPct  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ ZSwingPct  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ SwingPct   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ OContactPct: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ ZContactPct: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ ContactPct : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ ZonePct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ FStrikePct : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ SwStrPct   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PullPct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ CentPct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ OppoPct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ SoftPct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ MedPct     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ HardPct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ HR1        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HR2        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

I am attempting to create a regression model to accurately predict individual players Home Run totals. I have created a spreadsheet using data from FanGraphs.com and I will measure each variable and look to see what variables I might be more inclined to use in finding the right model. If I ended up using all of these, I would have a very overfit model and the best models are often the most simple. I excluded any stats that heavily rely on Home Run data to calculate, so for example, there is no slugging percentage because a big part of that equation is Home Runs. I kept things like on base percentage though because home runs are weighted the same as a single in this calculation so it’s more of a measure of hitting ability than power ability. I also included Home Run to flyball ratio because they do track this and it can be above or below league average, but I likely will refrain from using it in the model because it directly uses home runs in the calculation. It will still be helpful to see some statistics about this. I will do that in this document and do the regression analysis in the next document. I will give a short explination for each of the variables below. Here are the initial inputs.

Name: The players name in the data set, used here for confirmation all the data matches up
HR: The dependent variable in this study, home runs the player hit
Age: The players age
PA: The amount of time a player bats including walks, hit by pitches, sacirfices and so on
Doubles: The amount of doubles a player hits
BBPct: The amount of walks a player has divided by plate appearances
KPct: The amount of strikeouts a player has divided by plate appearances
BB_K: A hitters strikeout to walk ratio
OBP: The times a player reaches base via hit, walk, HBP divided by plate appearances
BABIP: A players batting average on balls they into play (excludes strikeouts)
GB_FB: A hitters ground ball to fly ball ratio
LDPct: Line Drive Percentage is the percentage of balls a player hits that are classified as line drives
GBPct: Ground Ball Percentage is the percentage of balls a player hits that are classified as ground balls
FBPct: Fly Ball Percentage is the percentage of balls a player hits that are classified as fly balls
HR_FB: The percentage of fly balls that are home runs for a player
wFB/wSL/wCT/wCB/wCH/wSF: Linear weights of run expectancy for each pitch type, so if a player hits a double on a 2-0 count off of a fastball, their wFB would get the linear weight equal to that, then if they strike out on a fastball next at-bat they’d get minus a linear weight for the same stat. Fastball, Slider, Cutter, Curveball, Changeup and Splitter
OSwingPct: Swings at pitches outside of the strikezone for a player divided by the players pitches seen outside the strikezone
ZSwingPct: Swings at pitches inside the strikezone for a player divided by the players pitches seen inside the strikezone
SwingPct: Swings divded by pitches to a player
OContactPct: Swings in which a player made contact on pitches outside of the strikezone divided by pitches seen outside the zone and swung at by the player
ZContactPct: Swings in which a player made contact on pitches inside of the strikezone divided by the pitches seen and swung at inside the zone for a player
ContactPct: Number of pitches in which contact was made divided by swings for a player
ZonePct: Percentage of pitches seen inside the strikezone for a player by total pitches they have seen
FStrikePct: Percentage of first pitch strikes seen by a player in plate appearances
SwStrPct: Swings and misses by a player divided by total pitches seen
Pull/Center/OppoPct: A player has a pull/center/opposite field, for a lefty their pull field is right field and righty their pull field is left field, center is center field area for both types of hitters and opposite field is the opposite of their pull field, this is the percentage of a players hits that went each way Soft/Med/HardPct: Percentage of a players contact on batted balls classified by how they hit the ball, sum to 100%
HR1: Home Runs from 2017 (previous season)
HR2: Home Runs from 2016 (Two season prior)

summary(HomeRuns)
##      Name                 HR             Age              PA       
##  Length:355         Min.   : 1.00   Min.   :19.00   Min.   :200.0  
##  Class :character   1st Qu.: 7.00   1st Qu.:25.00   1st Qu.:322.0  
##  Mode  :character   Median :13.00   Median :28.00   Median :456.0  
##                     Mean   :14.45   Mean   :28.19   Mean   :449.4  
##                     3rd Qu.:20.00   3rd Qu.:31.00   3rd Qu.:580.5  
##                     Max.   :48.00   Max.   :39.00   Max.   :745.0  
##                                                                    
##     Doubles          BBPct              KPct             BB_K       
##  Min.   : 3.00   Min.   :0.01500   Min.   :0.0730   Min.   :0.1000  
##  1st Qu.:13.00   1st Qu.:0.06400   1st Qu.:0.1760   1st Qu.:0.2900  
##  Median :19.00   Median :0.08400   Median :0.2160   Median :0.4000  
##  Mean   :20.85   Mean   :0.08624   Mean   :0.2174   Mean   :0.4263  
##  3rd Qu.:27.00   3rd Qu.:0.10500   3rd Qu.:0.2535   3rd Qu.:0.5250  
##  Max.   :51.00   Max.   :0.20100   Max.   :0.3850   Max.   :1.3300  
##                                                                     
##       OBP             BABIP            GB_FB           LDPct       
##  Min.   :0.2320   Min.   :0.1890   Min.   :0.510   Min.   :0.1400  
##  1st Qu.:0.2985   1st Qu.:0.2725   1st Qu.:0.970   1st Qu.:0.1915  
##  Median :0.3220   Median :0.2980   Median :1.180   Median :0.2130  
##  Mean   :0.3224   Mean   :0.2972   Mean   :1.269   Mean   :0.2149  
##  3rd Qu.:0.3450   3rd Qu.:0.3200   3rd Qu.:1.495   3rd Qu.:0.2370  
##  Max.   :0.4600   Max.   :0.4060   Max.   :3.900   Max.   :0.3230  
##                                                                    
##      GBPct            FBPct            HR_FB             wFB         
##  Min.   :0.2400   Min.   :0.1620   Min.   :0.0120   Min.   :-16.500  
##  1st Qu.:0.3800   1st Qu.:0.3110   1st Qu.:0.0840   1st Qu.: -3.300  
##  Median :0.4240   Median :0.3620   Median :0.1230   Median :  2.400  
##  Mean   :0.4264   Mean   :0.3587   Mean   :0.1282   Mean   :  3.579  
##  3rd Qu.:0.4745   3rd Qu.:0.4010   3rd Qu.:0.1690   3rd Qu.:  8.800  
##  Max.   :0.6310   Max.   :0.5170   Max.   :0.3500   Max.   : 40.100  
##                                                                      
##       wSL                wCT               wCB               wCH          
##  Min.   :-12.4000   Min.   :-6.4000   Min.   :-7.6000   Min.   :-13.6000  
##  1st Qu.: -3.1000   1st Qu.:-1.4000   1st Qu.:-1.5000   1st Qu.: -1.4000  
##  Median : -1.0000   Median : 0.0000   Median : 0.5000   Median :  0.2000  
##  Mean   : -0.7946   Mean   : 0.0862   Mean   : 0.5485   Mean   :  0.5099  
##  3rd Qu.:  1.8000   3rd Qu.: 1.4000   3rd Qu.: 2.2000   3rd Qu.:  2.1500  
##  Max.   : 12.1000   Max.   : 8.5000   Max.   :11.9000   Max.   : 16.4000  
##                                                                           
##       wSF             OSwingPct        ZSwingPct         SwingPct     
##  Min.   :-3.50000   Min.   :0.1440   Min.   :0.5100   Min.   :0.3430  
##  1st Qu.:-0.60000   1st Qu.:0.2675   1st Qu.:0.6405   1st Qu.:0.4320  
##  Median :-0.10000   Median :0.3100   Median :0.6800   Median :0.4680  
##  Mean   : 0.01048   Mean   :0.3092   Mean   :0.6789   Mean   :0.4675  
##  3rd Qu.: 0.60000   3rd Qu.:0.3510   3rd Qu.:0.7165   3rd Qu.:0.4995  
##  Max.   : 5.00000   Max.   :0.4840   Max.   :0.8510   Max.   :0.6110  
##  NA's   :2                                                            
##   OContactPct      ZContactPct       ContactPct        ZonePct     
##  Min.   :0.4180   Min.   :0.7020   Min.   :0.6100   Min.   :0.374  
##  1st Qu.:0.5705   1st Qu.:0.8275   1st Qu.:0.7320   1st Qu.:0.413  
##  Median :0.6340   Median :0.8630   Median :0.7730   Median :0.429  
##  Mean   :0.6311   Mean   :0.8592   Mean   :0.7736   Mean   :0.428  
##  3rd Qu.:0.6910   3rd Qu.:0.8935   3rd Qu.:0.8145   3rd Qu.:0.442  
##  Max.   :0.8360   Max.   :0.9730   Max.   :0.9110   Max.   :0.483  
##                                                                    
##    FStrikePct        SwStrPct         PullPct          CentPct      
##  Min.   :0.4910   Min.   :0.0360   Min.   :0.2550   Min.   :0.2180  
##  1st Qu.:0.5760   1st Qu.:0.0835   1st Qu.:0.3770   1st Qu.:0.3195  
##  Median :0.6030   Median :0.1050   Median :0.4080   Median :0.3410  
##  Mean   :0.6027   Mean   :0.1066   Mean   :0.4091   Mean   :0.3428  
##  3rd Qu.:0.6260   3rd Qu.:0.1295   3rd Qu.:0.4480   3rd Qu.:0.3640  
##  Max.   :0.6980   Max.   :0.2380   Max.   :0.5870   Max.   :0.4490  
##                                                                     
##     OppoPct          SoftPct           MedPct         HardPct      
##  Min.   :0.1590   Min.   :0.0840   Min.   :0.348   Min.   :0.1910  
##  1st Qu.:0.2170   1st Qu.:0.1535   1st Qu.:0.429   1st Qu.:0.3185  
##  Median :0.2460   Median :0.1780   Median :0.461   Median :0.3620  
##  Mean   :0.2482   Mean   :0.1768   Mean   :0.463   Mean   :0.3604  
##  3rd Qu.:0.2750   3rd Qu.:0.1990   3rd Qu.:0.494   3rd Qu.:0.3995  
##  Max.   :0.3620   Max.   :0.3070   Max.   :0.612   Max.   :0.5090  
##                                                                    
##       HR1             HR2       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:10.00   1st Qu.: 8.00  
##  Median :17.00   Median :15.00  
##  Mean   :17.84   Mean   :16.71  
##  3rd Qu.:25.00   3rd Qu.:24.00  
##  Max.   :59.00   Max.   :47.00  
##  NA's   :78      NA's   :130

Everything looks to be lined up well to run correlation and some graphs, all the information is as expected, we have some NAs for HR1 and HR2 but I will build a seperate model for players without these numbers. One issue that I can not fix in the scope of this project would be project home runs for a potential rookie, that would require extra analysis of players rookie seasons and likely I’d match them up in a tiered group of prospect rankings from either MLB.com or Baseball America. I could also make a model without past homers or I could run a players minor league stats through this and regress it to the mean and then deduct a percentage for facing a lower level of competition. Minor league to major league equivalent stats actually DO exist! Anyways let’s just tackle finding out what variables to use in this regression model so we can make our semi-decent home run prediction model.

corHR1 <- cor(HomeRuns[,2:5])
corrplot.mixed(corHR1)

corHR2 <- cor(HomeRuns[,c(2,6:10)])
corrplot.mixed(corHR2)

corHR3 <- cor(HomeRuns[,c(2,11:15)])
corrplot(corHR3, method = "pie")

corHR4 <- cor(HomeRuns[,c(2,16:20)])
corrplot(corHR4, method = 'square')

corHR4v2 <- cor(HomeRuns[,c(2,16:20)])
corrplot.mixed(corHR4v2)

Wow I can see why R is a great statistical program at this point. You can’t do this in Excel so quickly without spending a lot of time an effort, especially with no add-ins. I added Home Runs to each correlation plot because these are the dependent variable and I would like to measure everything against them mostly. I know I can use interaction between to variables in a regression model, but I’d like to keep this model more simple and I don’t think I really need to use them since I have many useful ratios at my disposal already. Age is an issue here because almost all projection system account for it. As you get older you tned to decline unless you’re Barry Bonds. I’ll have to figure out a way to incorporate this into model since the correlation is at -.09 or basically nothing. The big winners so far are plate appearances, doubles, on base percentage, linear weighted fastball runs, flyballs hit and to a certain extent walks. I was hoping for higher line drive correlation since line drives have the highest batting average of any hit, but it makes sense that more would fall in for hits than leave the playing field. I also like the ground ball percentage negative effect, as unless it’s the rare inside the park homer you will not be hitting a home run on a ground ball. Another disappointment was that FanGraphs didn’t have MLB’s new Statvast data avaialable because average fly ball distance and barrels, which are highly well hit balls in the top tier of batted balls measured by am percentage, are tracked. If I had to add anything to this model those would likely be really useful and it’s definitely something to think about in the future of adding, but for right now I still think we can build this off of FanGraphs data, and these might have even been redudent of that.

corHR5 <- cor(HomeRuns[,c(2,21:25)])
corrplot.mixed(corHR5)

corHR6 <- cor(HomeRuns[,c(2,26:30)])
corrplot.mixed(corHR6)

corHR7 <- cor(HomeRuns[,c(2,31:36)])
corrplot.mixed(corHR7)

corHR8 <- cor(na.omit(HomeRuns[,c(2,37:38)]))
corrplot.mixed(corHR8)

A lot more “meh” here. There’s a semi-strong negative correlation between ZonePct and HR, which is odd but makes sense in a way because good hitters wouldn’t see a lot of strikes overall. Other than that only pull and hard percentages have a positive correlation. Medium contact has worse negative correlation than soft contact. Home Runs from 2017 and 2016 are highly correlated which I will go over below. Most of these correlations are between -.5 and .5 because there’s a decent amount of data and truly it’s hard to find one true killer variable because while some guys who hit the ball to their pull field for homers, there’s another hitter who has a different approach and uses the opposite field as well, former Tiger JD Martinez and current Tiger Miguel Cabrera are known for this. JD hit many home runs near me when I used to have season tickets in right field despite being a right handed hitter. For guys like that there’s hitters like Joey Gallo who when they hit everybody moves the right side of the field since he hits left handed and will almost always pull the ball. Due to this, the regression model is a bit harder to make and that’s why it’ll be multiple regression.

corHRMatrix <- cor(HomeRuns[,c(2:38)])
corHRMatrix
##                      HR          Age            PA      Doubles
## HR           1.00000000 -0.090299445  0.6792503006  0.584693228
## Age         -0.09029944  1.000000000 -0.0397789691 -0.095761174
## PA           0.67925030 -0.039778969  1.0000000000  0.835024546
## Doubles      0.58469323 -0.095761174  0.8350245455  1.000000000
## BBPct        0.27816333  0.046958324  0.1588074021  0.101973610
## KPct         0.04049126 -0.190044686 -0.2928936593 -0.311309323
## BB_K         0.18908870  0.147434367  0.3056301292  0.287728584
## OBP          0.41247124 -0.018245456  0.4321573710  0.467514825
## BABIP        0.03970784 -0.206690961  0.1846870810  0.287115482
## GB_FB       -0.35196868 -0.069689964 -0.0701439952 -0.160348092
## LDPct       -0.14456899  0.133322517  0.0638513646  0.151910600
## GBPct       -0.32387290 -0.112792398 -0.0919772857 -0.192921052
## FBPct        0.40055732  0.048617464  0.0617214386  0.120855547
## HR_FB        0.73868672 -0.128780172  0.1980686668  0.132992884
## wFB          0.63918558 -0.090182610  0.4778438482  0.513651108
## wSL          0.30023683 -0.090069487  0.1813668111  0.289053427
## wCT          0.28691416 -0.017491911  0.1166053037  0.148138674
## wCB          0.32539861 -0.037472126  0.3095655582  0.402145790
## wCH          0.35911480 -0.045297853  0.2819706030  0.343760113
## wSF                  NA           NA            NA           NA
## OSwingPct    0.02271830 -0.134566267  0.0004340958  0.033283315
## ZSwingPct    0.14228903 -0.129429038  0.0530311210  0.078024478
## SwingPct     0.01455212 -0.148569701 -0.0032899893  0.031040370
## OContactPct -0.08444999  0.136280545  0.2448017773  0.261809784
## ZContactPct -0.15038268  0.158778484  0.1612281754  0.188456345
## ContactPct  -0.15630029  0.175901601  0.2011620072  0.221898877
## ZonePct     -0.46927650  0.045850821 -0.1930785642 -0.196194487
## FStrikePct  -0.21969017 -0.147931240 -0.1340779679 -0.085383820
## SwStrPct     0.13706832 -0.209896180 -0.1726690164 -0.174954281
## PullPct      0.28241832  0.028079302  0.0186140567  0.075579889
## CentPct     -0.09783550 -0.049289224  0.0423617962 -0.007386045
## OppoPct     -0.29573663  0.003471035 -0.0593264592 -0.094430399
## SoftPct     -0.24351909 -0.141208421 -0.1582199830 -0.232756222
## MedPct      -0.45121188 -0.009413684 -0.1210374428 -0.131044339
## HardPct      0.50831560  0.093370445  0.1931767672  0.246863056
## HR1                  NA           NA            NA           NA
## HR2                  NA           NA            NA           NA
##                   BBPct         KPct        BB_K         OBP       BABIP
## HR           0.27816333  0.040491256  0.18908870  0.41247124  0.03970784
## Age          0.04695832 -0.190044686  0.14743437 -0.01824546 -0.20669096
## PA           0.15880740 -0.292893659  0.30563013  0.43215737  0.18468708
## Doubles      0.10197361 -0.311309323  0.28772858  0.46751482  0.28711548
## BBPct        1.00000000  0.116753568  0.70463116  0.62778926  0.01490670
## KPct         0.11675357  1.000000000 -0.54837498 -0.27094953  0.12282115
## BB_K         0.70463116 -0.548374980  1.00000000  0.67240948 -0.05064999
## OBP          0.62778926 -0.270949528  0.67240948  1.00000000  0.57915552
## BABIP        0.01490670  0.122821146 -0.05064999  0.57915552  1.00000000
## GB_FB       -0.13458652 -0.106730783 -0.05835622  0.03680014  0.29796570
## LDPct       -0.04268523 -0.122929201  0.05363512  0.23344362  0.40779958
## GBPct       -0.13876791 -0.092866223 -0.07381997 -0.03574927  0.18987760
## FBPct        0.16134957  0.155393565  0.04763281 -0.07880367 -0.39335577
## HR_FB        0.29540723  0.406246894 -0.03569535  0.31610976  0.13183381
## wFB          0.52460255 -0.053550544  0.46200638  0.76511435  0.40905236
## wSL          0.18108407 -0.148023481  0.25508069  0.44230436  0.27201567
## wCT          0.09226594 -0.008199828  0.06957319  0.23412520  0.09817940
## wCB          0.19450794 -0.209977854  0.31100888  0.45730273  0.26820397
## wCH          0.12243966 -0.187130408  0.23674788  0.38128847  0.20681404
## wSF                  NA           NA          NA          NA          NA
## OSwingPct   -0.75840468  0.006596864 -0.61799719 -0.42863873  0.03138310
## ZSwingPct   -0.34396678  0.095640721 -0.32498490 -0.15731959  0.08146441
## SwingPct    -0.70474973  0.031724742 -0.58036251 -0.39404407  0.05862901
## OContactPct -0.13378822 -0.834076818  0.45789371  0.17932170 -0.10309002
## ZContactPct -0.15045233 -0.822834035  0.41157237  0.16069489 -0.10243559
## ContactPct  -0.04796874 -0.878566886  0.54239225  0.23207269 -0.10947565
## ZonePct     -0.08642299 -0.134980969  0.03426291 -0.13444553 -0.01141790
## FStrikePct  -0.54643944  0.109559316 -0.50351270 -0.32169559  0.18188078
## SwStrPct    -0.21086250  0.760045825 -0.65613645 -0.33057946  0.12645878
## PullPct      0.16324537  0.099654553  0.06701096 -0.04007681 -0.35145423
## CentPct     -0.07876021 -0.022823135 -0.03963191  0.03675304  0.23587764
## OppoPct     -0.15264318 -0.114427697 -0.05647888  0.02255063  0.27252854
## SoftPct     -0.23849242 -0.115980782 -0.13264252 -0.32953666 -0.26101676
## MedPct      -0.18400250 -0.249481993  0.03271656 -0.15278412 -0.02787313
## HardPct      0.29172264  0.269391999  0.05466916  0.32305639  0.18186881
## HR1                  NA           NA          NA          NA          NA
## HR2                  NA           NA          NA          NA          NA
##                     GB_FB         LDPct       GBPct         FBPct
## HR          -0.3519686760 -0.1445689854 -0.32387290  0.4005573202
## Age         -0.0696899642  0.1333225167 -0.11279240  0.0486174645
## PA          -0.0701439952  0.0638513646 -0.09197729  0.0617214386
## Doubles     -0.1603480916  0.1519105995 -0.19292105  0.1208555468
## BBPct       -0.1345865176 -0.0426852301 -0.13876791  0.1613495708
## KPct        -0.1067307831 -0.1229292012 -0.09286622  0.1553935651
## BB_K        -0.0583562197  0.0536351168 -0.07381997  0.0476328129
## OBP          0.0368001360  0.2334436186 -0.03574927 -0.0788036738
## BABIP        0.2979656966  0.4077995779  0.18987760 -0.3933557650
## GB_FB        1.0000000000 -0.0003043942  0.91993912 -0.9337892559
## LDPct       -0.0003043942  1.0000000000 -0.27359435 -0.2151016007
## GBPct        0.9199391187 -0.2735943508  1.00000000 -0.8804543725
## FBPct       -0.9337892559 -0.2151016007 -0.88045437  1.0000000000
## HR_FB       -0.1261463115 -0.1907703963 -0.07235716  0.1683220364
## wFB         -0.1133450212  0.1054214490 -0.15180723  0.1022818759
## wSL         -0.0124680358  0.0996687841 -0.04856734  0.0006646876
## wCT         -0.1298227000 -0.0307123516 -0.12960590  0.1477242612
## wCB         -0.0516241043  0.1296920063 -0.05790534 -0.0055944335
## wCH         -0.1138891794  0.0355924916 -0.12374460  0.1077281791
## wSF                    NA            NA          NA            NA
## OSwingPct    0.0124566464  0.0058259051  0.01879645 -0.0215059395
## ZSwingPct   -0.1227552317  0.0172294462 -0.12282024  0.1168806165
## SwingPct    -0.0194849770  0.0452680924 -0.02405124  0.0026697962
## OContactPct  0.0803912600  0.1951852923  0.03505043 -0.1324669604
## ZContactPct  0.1361508489  0.1772877664  0.08555952 -0.1747404968
## ContactPct   0.1194508621  0.2144467930  0.06240218 -0.1697506856
## ZonePct      0.2173909053  0.2185186206  0.15322492 -0.2637705936
## FStrikePct   0.1525617107  0.1267104716  0.13328799 -0.1978260306
## SwStrPct    -0.1017911659 -0.1636570078 -0.05681011  0.1391649295
## PullPct     -0.5067682592 -0.2226650346 -0.43349559  0.5500515369
## CentPct      0.3679533936  0.0049051843  0.38512585 -0.3936000738
## OppoPct      0.3715057160  0.2917498759  0.26008550 -0.4079633056
## SoftPct      0.1067646046 -0.3343315175  0.19520366 -0.0334067017
## MedPct       0.1869774060  0.1728778312  0.16017490 -0.2479710865
## HardPct     -0.2148279887  0.0669851896 -0.24762033  0.2185057154
## HR1                    NA            NA          NA            NA
## HR2                    NA            NA          NA            NA
##                     HR_FB         wFB           wSL          wCT
## HR           0.7386867242  0.63918558  0.3002368323  0.286914156
## Age         -0.1287801724 -0.09018261 -0.0900694867 -0.017491911
## PA           0.1980686668  0.47784385  0.1813668111  0.116605304
## Doubles      0.1329928843  0.51365111  0.2890534268  0.148138674
## BBPct        0.2954072288  0.52460255  0.1810840709  0.092265936
## KPct         0.4062468941 -0.05355054 -0.1480234809 -0.008199828
## BB_K        -0.0356953517  0.46200638  0.2550806883  0.069573190
## OBP          0.3161097634  0.76511435  0.4423043632  0.234125201
## BABIP        0.1318338142  0.40905236  0.2720156712  0.098179399
## GB_FB       -0.1261463115 -0.11334502 -0.0124680358 -0.129822700
## LDPct       -0.1907703963  0.10542145  0.0996687841 -0.030712352
## GBPct       -0.0723571574 -0.15180723 -0.0485673406 -0.129605903
## FBPct        0.1683220364  0.10228188  0.0006646876  0.147724261
## HR_FB        1.0000000000  0.51116753  0.2639483859  0.234172474
## wFB          0.5111675298  1.00000000  0.2394326503  0.144922210
## wSL          0.2639483859  0.23943265  1.0000000000  0.092205408
## wCT          0.2341724735  0.14492221  0.0922054082  1.000000000
## wCB          0.2167180377  0.29520463  0.3085986516  0.102155709
## wCH          0.1922032451  0.26352985  0.1737058840  0.169048250
## wSF                    NA          NA            NA           NA
## OSwingPct    0.0107050718 -0.25398601 -0.1298770019 -0.043346315
## ZSwingPct    0.1309564234 -0.06494049  0.0471104613  0.086105475
## SwingPct     0.0007563243 -0.25152059 -0.0874160654  0.005088544
## OContactPct -0.4155650298 -0.00634712  0.1125594328 -0.007233188
## ZContactPct -0.4571648503 -0.04960880  0.0661154335 -0.007179248
## ContactPct  -0.4918536865 -0.01157864  0.1109990610  0.002086832
## ZonePct     -0.4731224484 -0.29106451 -0.1262969942 -0.055578529
## FStrikePct  -0.1434723497 -0.28450281 -0.0813124868  0.020558849
## SwStrPct     0.4173163509 -0.07018229 -0.1190874504  0.003222963
## PullPct      0.2193148616  0.08990250  0.0351485606  0.113862998
## CentPct     -0.0333339038 -0.01829294 -0.0242632085 -0.017897043
## OppoPct     -0.2649926826 -0.10571829 -0.0269560651 -0.135982430
## SoftPct     -0.2522301764 -0.34566584 -0.1440752593 -0.098518572
## MedPct      -0.5112483360 -0.25240675 -0.2323547226 -0.106019137
## HardPct      0.5612258398  0.41249309  0.2731754590  0.144834413
## HR1                    NA          NA            NA           NA
## HR2                    NA          NA            NA           NA
##                      wCB          wCH wSF     OSwingPct    ZSwingPct
## HR           0.325398614  0.359114802  NA  0.0227183025  0.142289027
## Age         -0.037472126 -0.045297853  NA -0.1345662674 -0.129429038
## PA           0.309565558  0.281970603  NA  0.0004340958  0.053031121
## Doubles      0.402145790  0.343760113  NA  0.0332833150  0.078024478
## BBPct        0.194507937  0.122439659  NA -0.7584046828 -0.343966784
## KPct        -0.209977854 -0.187130408  NA  0.0065968635  0.095640721
## BB_K         0.311008879  0.236747881  NA -0.6179971902 -0.324984902
## OBP          0.457302734  0.381288466  NA -0.4286387277 -0.157319587
## BABIP        0.268203965  0.206814042  NA  0.0313830952  0.081464412
## GB_FB       -0.051624104 -0.113889179  NA  0.0124566464 -0.122755232
## LDPct        0.129692006  0.035592492  NA  0.0058259051  0.017229446
## GBPct       -0.057905345 -0.123744595  NA  0.0187964453 -0.122820236
## FBPct       -0.005594433  0.107728179  NA -0.0215059395  0.116880617
## HR_FB        0.216718038  0.192203245  NA  0.0107050718  0.130956423
## wFB          0.295204627  0.263529852  NA -0.2539860109 -0.064940488
## wSL          0.308598652  0.173705884  NA -0.1298770019  0.047110461
## wCT          0.102155709  0.169048250  NA -0.0433463150  0.086105475
## wCB          1.000000000  0.159257943  NA -0.1003841480  0.044316237
## wCH          0.159257943  1.000000000  NA -0.0630303334 -0.033961531
## wSF                   NA           NA   1            NA           NA
## OSwingPct   -0.100384148 -0.063030333  NA  1.0000000000  0.570731355
## ZSwingPct    0.044316237 -0.033961531  NA  0.5707313553  1.000000000
## SwingPct    -0.056137909 -0.076251715  NA  0.9133532788  0.835098735
## OContactPct  0.174630545  0.203181155  NA -0.0001205896 -0.227267222
## ZContactPct  0.106715841  0.184121828  NA -0.0662284193 -0.319200707
## ContactPct   0.164669555  0.200341757  NA -0.1936958173 -0.337996931
## ZonePct     -0.056045293 -0.099945805  NA -0.3681043660 -0.337803027
## FStrikePct  -0.067711094 -0.150868015  NA  0.5125849177  0.386497764
## SwStrPct    -0.155630966 -0.200726521  NA  0.4834862181  0.577177991
## PullPct      0.034609794  0.007303391  NA -0.0846409962  0.008440565
## CentPct     -0.041503161  0.007441391  NA  0.0474219273 -0.020530538
## OppoPct     -0.012968316 -0.015431467  NA  0.0728791191  0.004313378
## SoftPct     -0.099836175 -0.151159641  NA  0.1818296773  0.018954245
## MedPct      -0.152511374 -0.118252104  NA  0.0017208815 -0.081453268
## HardPct      0.182304371  0.186229199  NA -0.1122019349  0.054000561
## HR1                   NA           NA  NA            NA           NA
## HR2                   NA           NA  NA            NA           NA
##                  SwingPct   OContactPct  ZContactPct    ContactPct
## HR           0.0145521214 -0.0844499938 -0.150382678 -0.1563002879
## Age         -0.1485697005  0.1362805453  0.158778484  0.1759016013
## PA          -0.0032899893  0.2448017773  0.161228175  0.2011620072
## Doubles      0.0310403702  0.2618097844  0.188456345  0.2218988774
## BBPct       -0.7047497291 -0.1337882239 -0.150452326 -0.0479687407
## KPct         0.0317247418 -0.8340768177 -0.822834035 -0.8785668856
## BB_K        -0.5803625081  0.4578937070  0.411572373  0.5423922549
## OBP         -0.3940440653  0.1793216996  0.160694891  0.2320726873
## BABIP        0.0586290130 -0.1030900221 -0.102435591 -0.1094756461
## GB_FB       -0.0194849770  0.0803912600  0.136150849  0.1194508621
## LDPct        0.0452680924  0.1951852923  0.177287766  0.2144467930
## GBPct       -0.0240512446  0.0350504313  0.085559516  0.0624021796
## FBPct        0.0026697962 -0.1324669604 -0.174740497 -0.1697506856
## HR_FB        0.0007563243 -0.4155650298 -0.457164850 -0.4918536865
## wFB         -0.2515205869 -0.0063471202 -0.049608796 -0.0115786393
## wSL         -0.0874160654  0.1125594328  0.066115434  0.1109990610
## wCT          0.0050885440 -0.0072331879 -0.007179248  0.0020868317
## wCB         -0.0561379091  0.1746305446  0.106715841  0.1646695548
## wCH         -0.0762517149  0.2031811549  0.184121828  0.2003417570
## wSF                    NA            NA           NA            NA
## OSwingPct    0.9133532788 -0.0001205896 -0.066228419 -0.1936958173
## ZSwingPct    0.8350987348 -0.2272672218 -0.319200707 -0.3379969308
## SwingPct     1.0000000000 -0.0809975314 -0.165283894 -0.2458807583
## OContactPct -0.0809975314  1.0000000000  0.722964542  0.9135155999
## ZContactPct -0.1652838942  0.7229645419  1.000000000  0.9089174379
## ContactPct  -0.2458807583  0.9135155999  0.908917438  1.0000000000
## ZonePct     -0.2687880782  0.2072486471  0.250178433  0.3575362118
## FStrikePct   0.5609406142 -0.1106820224 -0.106444210 -0.1652682671
## SwStrPct     0.5589593810 -0.8058474304 -0.827716206 -0.9356948442
## PullPct     -0.0818921086 -0.1278886546 -0.133970200 -0.1358170532
## CentPct      0.0290314912 -0.0117384871  0.021346534 -0.0005122881
## OppoPct      0.0841144115  0.1802365973  0.161830479  0.1820991093
## SoftPct      0.1406332107  0.1054229332  0.082505919  0.0718604265
## MedPct       0.0125934349  0.2838620859  0.302412561  0.3282332809
## HardPct     -0.0953584615 -0.2903799000 -0.291338610 -0.3052859633
## HR1                    NA            NA           NA            NA
## HR2                    NA            NA           NA            NA
##                 ZonePct  FStrikePct     SwStrPct      PullPct
## HR          -0.46927650 -0.21969017  0.137068322  0.282418324
## Age          0.04585082 -0.14793124 -0.209896180  0.028079302
## PA          -0.19307856 -0.13407797 -0.172669016  0.018614057
## Doubles     -0.19619449 -0.08538382 -0.174954281  0.075579889
## BBPct       -0.08642299 -0.54643944 -0.210862498  0.163245370
## KPct        -0.13498097  0.10955932  0.760045825  0.099654553
## BB_K         0.03426291 -0.50351270 -0.656136445  0.067010960
## OBP         -0.13444553 -0.32169559 -0.330579464 -0.040076807
## BABIP       -0.01141790  0.18188078  0.126458775 -0.351454232
## GB_FB        0.21739091  0.15256171 -0.101791166 -0.506768259
## LDPct        0.21851862  0.12671047 -0.163657008 -0.222665035
## GBPct        0.15322492  0.13328799 -0.056810105 -0.433495588
## FBPct       -0.26377059 -0.19782603  0.139164929  0.550051537
## HR_FB       -0.47312245 -0.14347235  0.417316351  0.219314862
## wFB         -0.29106451 -0.28450281 -0.070182287  0.089902497
## wSL         -0.12629699 -0.08131249 -0.119087450  0.035148561
## wCT         -0.05557853  0.02055885  0.003222963  0.113862998
## wCB         -0.05604529 -0.06771109 -0.155630966  0.034609794
## wCH         -0.09994581 -0.15086801 -0.200726521  0.007303391
## wSF                  NA          NA           NA           NA
## OSwingPct   -0.36810437  0.51258492  0.483486218 -0.084640996
## ZSwingPct   -0.33780303  0.38649776  0.577177991  0.008440565
## SwingPct    -0.26878808  0.56094061  0.558959381 -0.081892109
## OContactPct  0.20724865 -0.11068202 -0.805847430 -0.127888655
## ZContactPct  0.25017843 -0.10644421 -0.827716206 -0.133970200
## ContactPct   0.35753621 -0.16526827 -0.935694844 -0.135817053
## ZonePct      1.00000000  0.11716484 -0.385315235 -0.181998686
## FStrikePct   0.11716484  1.00000000  0.340808048 -0.159518397
## SwStrPct    -0.38531523  0.34080805  1.000000000  0.075705980
## PullPct     -0.18199869 -0.15951840  0.075705980  1.000000000
## CentPct      0.04532456  0.07139487  0.022704421 -0.661193613
## OppoPct      0.20628607  0.15306128 -0.120603901 -0.788003499
## SoftPct      0.04696737  0.12579692 -0.009844386 -0.026838058
## MedPct       0.34335602  0.09170947 -0.265091060 -0.230908482
## HardPct     -0.30177625 -0.14967654  0.217320532  0.200570593
## HR1                  NA          NA           NA           NA
## HR2                  NA          NA           NA           NA
##                   CentPct      OppoPct      SoftPct       MedPct
## HR          -0.0978354984 -0.295736629 -0.243519089 -0.451211879
## Age         -0.0492892245  0.003471035 -0.141208421 -0.009413684
## PA           0.0423617962 -0.059326459 -0.158219983 -0.121037443
## Doubles     -0.0073860446 -0.094430399 -0.232756222 -0.131044339
## BBPct       -0.0787602115 -0.152643178 -0.238492417 -0.184002498
## KPct        -0.0228231348 -0.114427697 -0.115980782 -0.249481993
## BB_K        -0.0396319090 -0.056478884 -0.132642521  0.032716560
## OBP          0.0367530351  0.022550630 -0.329536658 -0.152784115
## BABIP        0.2358776352  0.272528536 -0.261016764 -0.027873125
## GB_FB        0.3679533936  0.371505716  0.106764605  0.186977406
## LDPct        0.0049051843  0.291749876 -0.334331518  0.172877831
## GBPct        0.3851258508  0.260085504  0.195203663  0.160174902
## FBPct       -0.3936000738 -0.407963306 -0.033406702 -0.247971087
## HR_FB       -0.0333339038 -0.264992683 -0.252230176 -0.511248336
## wFB         -0.0182929407 -0.105718290 -0.345665844 -0.252406747
## wSL         -0.0242632085 -0.026956065 -0.144075259 -0.232354723
## wCT         -0.0178970435 -0.135982430 -0.098518572 -0.106019137
## wCB         -0.0415031613 -0.012968316 -0.099836175 -0.152511374
## wCH          0.0074413906 -0.015431467 -0.151159641 -0.118252104
## wSF                    NA           NA           NA           NA
## OSwingPct    0.0474219273  0.072879119  0.181829677  0.001720882
## ZSwingPct   -0.0205305378  0.004313378  0.018954245 -0.081453268
## SwingPct     0.0290314912  0.084114412  0.140633211  0.012593435
## OContactPct -0.0117384871  0.180236597  0.105422933  0.283862086
## ZContactPct  0.0213465338  0.161830479  0.082505919  0.302412561
## ContactPct  -0.0005122881  0.182099109  0.071860427  0.328233281
## ZonePct      0.0453245614  0.206286065  0.046967373  0.343356019
## FStrikePct   0.0713948690  0.153061281  0.125796924  0.091709467
## SwStrPct     0.0227044207 -0.120603901 -0.009844386 -0.265091060
## PullPct     -0.6611936132 -0.788003499 -0.026838058 -0.230908482
## CentPct      1.0000000000  0.059222688 -0.008133611  0.036553617
## OppoPct      0.0592226880  1.000000000  0.042802076  0.278599212
## SoftPct     -0.0081336113  0.042802076  1.000000000 -0.008202781
## MedPct       0.0365536171  0.278599212 -0.008202781  1.000000000
## HardPct     -0.0241122719 -0.248419212 -0.604347852 -0.791692804
## HR1                    NA           NA           NA           NA
## HR2                    NA           NA           NA           NA
##                 HardPct HR1 HR2
## HR           0.50831560  NA  NA
## Age          0.09337045  NA  NA
## PA           0.19317677  NA  NA
## Doubles      0.24686306  NA  NA
## BBPct        0.29172264  NA  NA
## KPct         0.26939200  NA  NA
## BB_K         0.05466916  NA  NA
## OBP          0.32305639  NA  NA
## BABIP        0.18186881  NA  NA
## GB_FB       -0.21482799  NA  NA
## LDPct        0.06698519  NA  NA
## GBPct       -0.24762033  NA  NA
## FBPct        0.21850572  NA  NA
## HR_FB        0.56122584  NA  NA
## wFB          0.41249309  NA  NA
## wSL          0.27317546  NA  NA
## wCT          0.14483441  NA  NA
## wCB          0.18230437  NA  NA
## wCH          0.18622920  NA  NA
## wSF                  NA  NA  NA
## OSwingPct   -0.11220193  NA  NA
## ZSwingPct    0.05400056  NA  NA
## SwingPct    -0.09535846  NA  NA
## OContactPct -0.29037990  NA  NA
## ZContactPct -0.29133861  NA  NA
## ContactPct  -0.30528596  NA  NA
## ZonePct     -0.30177625  NA  NA
## FStrikePct  -0.14967654  NA  NA
## SwStrPct     0.21732053  NA  NA
## PullPct      0.20057059  NA  NA
## CentPct     -0.02411227  NA  NA
## OppoPct     -0.24841921  NA  NA
## SoftPct     -0.60434785  NA  NA
## MedPct      -0.79169280  NA  NA
## HardPct      1.00000000  NA  NA
## HR1                  NA   1  NA
## HR2                  NA  NA   1

Here’s a correlation matrix which is just a big chart for everything I just did, it’s more for reference than going over results as I already highlighted the big takeaways.

ggplot(HomeRuns,aes(x=HR,y=HR1)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 78 rows containing missing values (geom_point).

ggplot(HomeRuns,aes(x=HR,y=HR2)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 130 rows containing non-finite values (stat_smooth).
## Warning: Removed 130 rows containing missing values (geom_point).

Tom Tango, a famed baseball statatican came up with a projection system called Marcel, which I’m pretty sure is named after the monkey from “Friends.” He named it this because people bugged him to make a projection system since he does so much good statistical baseball research but in reality Tango just wasn’t interested in doing this so he quickly made a very simple system that took a players last three home run totals, an age factor and regression to the league mean. I implemented the first two into this model but since it’s regression I will add other variables rather than regressing to league mean, but it’s again something one day I could add to this model. As you can see here though, a players last two seasons of home run totals are highly correlated with future totals.

ggplot(HomeRuns,aes(x=HR, fill=Age)) + geom_histogram(binwidth = 2) + facet_wrap(~Age)

Okay, here I am just trying to see if there’s any patterns for any age in home runs in the faceted histogram, but this is just a bunch of randomness.

ggplot(HomeRuns) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

We can see that the mean typically grows or stays the same froma ges 24-230 then slowly starts to go down.

HomeRuns2 <- HomeRuns$HR >= 20
HomeRuns3 <- HomeRuns[HomeRuns2,]

Let’s try and just take guys who hit 20 homers and see if power hitters show any more patterns.

ggplot(HomeRuns3,aes(x=HR, fill=Age)) + geom_histogram(binwidth = 2) + facet_wrap(~Age)

Yeah there’s not much more here as far as patterns go.

HomeRuns$Age2 <- 29 - HomeRuns$Age
HomeRuns$HRDiff <- HomeRuns$HR - HomeRuns$HR1

ggplot(HomeRuns,aes(x=HRDiff,y=Age2)) + geom_point() +geom_smooth(method = 'lm')
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 78 rows containing missing values (geom_point).

So now I’m designating 29 as my age of prime. Typically guys have their best seasons between 28-32, Miguel Cabrera won his MVPs in his age 28 and 29 seasons. Also Tom Tango uses 29 as his age in his projection system. If a guy is under 29 he gets credit for getting better when projeting next seasons home run total and if he’s over 29 than he is expected to start aging and get worse. I also created a HR Difference column to see the difference in home runs hit for a player from 2017 to 2018 to measure this, so if they hit less homers in 2018 than 2017 they will have a negative value.

OldAge <- HomeRuns$Age >= 29
HomeRuns4 <- HomeRuns[OldAge,]

ggplot(HomeRuns4,aes(x=HRDiff, fill=Age)) + geom_histogram(binwidth = 2)
## Warning: Removed 9 rows containing non-finite values (stat_bin).

ggplot(HomeRuns4,aes(x=HR1,y=HR)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

ggplot(HomeRuns4) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

That histogram looks like roughly what I’m looking for! There’s more guys on the negative side of the graph. Looks like being 30 really does mean you’re getting old. Now we just have to check the mean and median for the young players and hope that it’s less, and hopefully even positive, but due to variance we are looking for a number a few left of 0 for this variable and around 0 for the young guys. Also we can see the pattern of the average typically going down in the boxplot.

mean(na.omit(HomeRuns4$HRDiff))
## [1] -4
median(na.omit(HomeRuns4$HRDiff))
## [1] -4

-4 is both the mean and median which is good because that means the average over 29 player lost 4 homers from 2017 to 2018. That justifies my theory.

YoungAge <- HomeRuns$Age <= 29
HomeRuns5 <- HomeRuns[YoungAge,]

ggplot(HomeRuns5,aes(x=HRDiff, fill=Age)) + geom_histogram(binwidth = 2)
## Warning: Removed 72 rows containing non-finite values (stat_bin).

ggplot(HomeRuns5,aes(x=HR1,y=HR)) + geom_point() +geom_smooth(method = 'lm')
## Warning: Removed 72 rows containing non-finite values (stat_smooth).
## Warning: Removed 72 rows containing missing values (geom_point).

ggplot(HomeRuns5) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

mean(na.omit(HomeRuns5$HRDiff))
## [1] -0.9817073
median(na.omit(HomeRuns5$HRDiff))
## [1] -1

There’s now the young players above. The mean and median are roughly -1, which is pretty much 0, especially considering guys who may skew the data due to injuries. This Age2 variable will penalize older player in my regression model which is basically what I am looking for it to do, so it’s not perfect becuase I’d rather have the average and median for this group to be positive, it will work for my model for sure. The average typically goes here on the boxplots too except for the pesky 26 and 27 year olds. Seems like Tango was right on his cutoff point though.

ggplot(HomeRuns,aes(x=HR,y=PA)) + geom_point() + geom_smooth(method = 'lm')

Now I just simply made a scatter plot of each one of the other variables to graphically see a relationship, I know this is what my Forecasting professor, Professor Roumani would want me to do. We can see plate appearances really line up well with home runs, even almost as good as previous seasons home run totals. This makes sense because you need to get up to hit in order to hit home runs, you can’t do that from the bench. There’s a trend here, more PAs, more longballs.

ggplot(HomeRuns,aes(x=HR,y=wFB)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=wCB)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=wSL)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=wCH)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=wCT)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=wSF)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

Well it turns out being able to hit a fastball good really helps you hit home runs in the MLB. It is the fastest and most common pitch, so being able to hit well leads to homers. Every other pitch has a slight positive slope, which makes sense because if you hit a changeup really well, you likely will hit more home runs, but overall the other pitches are not very good indicators of home runs since I expect a positive slope I’d almost need more to put this in my model, though I might make a model just based on how well you can hit each pitch and we can see what pitches are actually significant.

ggplot(HomeRuns,aes(x=HR,y=Doubles)) + geom_point() + geom_smooth(method = 'lm')

My dad used to help me study and draft my fantasy teams, we still do a league together with some Texas Rangers beat writers (his best friend moved to Texas) and one of the first bits of advice he told me for drafting in the late round was if I was looking for power hitters that were going to breakout to look at doubles because if a guy hits a lot of doubles then he likely just missed those hits for home runs since doubles are usually hit deep. It’s nowhere near perfect, especially with the Statcast numbers now where you can know how hard and far every hit goes, but sure enough there’s some pretty high correlation here.

ggplot(HomeRuns,aes(x=HR,y=BB_K)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=BBPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=KPct)) + geom_point() + geom_smooth(method = 'lm')

Strikeouts and walks are two of three true outcomes in baseball along with home runs. This means these are the only outcomes a hitter has complete control of. In today’s game a lot of players strikeout often and it is often attributed to lower averages but higher home runs. We can see that walks have a bit of a correlation but strikeouts don’t have much of one at all, the thing is that strikeouts actually seem to have a small positive trend line, so interestingly enough strikeout prone hitters tend be better power hitters on average. Taking a walk compared to a strikeout is seen as a sign of a good eye, but we can see this barely translates into hitting for power.

ggplot(HomeRuns,aes(x=HR,y=OBP)) + geom_point() + geom_smooth(method = 'lm')

Home Runs calculate into this stat by times on base so there is some multicollinearity here, but not too much as these weigh the same as any other time on base. Being a good hitter does have somewhat of a relationship with hitting homers.

ggplot(HomeRuns,aes(x=HR,y=BABIP)) + geom_point() + geom_smooth(method = 'lm')

There’s really nothing to see here, despite the fact that you’d think having a good average on balls in play would lead to having more homers.

ggplot(HomeRuns,aes(x=HR,y=OSwingPct)) + geom_point() + geom_smooth(method = 'lm')

These are pitches outside of the strikezone that are swung at, so the freeswingers have no relationship with home runs, but it doesn’t hurt them to swing often, even if they swing at a lot of bad pitches.

ggplot(HomeRuns,aes(x=HR,y=ZSwingPct)) + geom_point() + geom_smooth(method = 'lm')

Again, swinging often helps you and it does help you slightly more if you swing at pitches in the zone.

ggplot(HomeRuns,aes(x=HR,y=SwingPct)) + geom_point() + geom_smooth(method = 'lm')

This is a case of two groups cancelling each other out, there’s really selective power hitters like Joey Votto and guys who swing at everything like Khris Davis whom both have power, so how much you swing really has no effect on home runs.

ggplot(HomeRuns,aes(x=HR,y=OContactPct)) + geom_point(color = "red") + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=ZContactPct)) + geom_point(color= "dark red") + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=ContactPct)) + geom_point() + geom_smooth(method = 'lm')

These graphs all show that being a good contact hitter might actually hurt you in the long run if you’re trying to exclusively hit home runs because making contact could just mean weak contact for outs. There’s actually a beauty to swinging hard just in case you hit the ball in today’s game it seems. The type of contact you make is more important than making it or where you make it.

ggplot(HomeRuns,aes(x=HR,y=ZonePct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=FStrikePct)) + geom_point() + geom_smooth(method = 'lm')

This proves that good power hitters truly see less first pitch strikes and strikes altogether, something interesting to keep in mind when I make my model because a sign of a power hitter could be lower percentages when it comes to these stats.

ggplot(HomeRuns,aes(x=HR,y=SwStrPct)) + geom_point() + geom_smooth(method = 'lm')

Another graphical example of how swinging leads to more homers, even if you miss the ball often.

ggplot(HomeRuns,aes(x=HR,y=PullPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=CentPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=OppoPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=SoftPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=MedPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=HardPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=GB_FB)) + geom_point() + geom_smooth(method = 'lm')

More fly balls compared to grounder = a better chance at a home run. Just look at the trend, and a negative trend is actually good for home runs in this case.

ggplot(HomeRuns,aes(x=HR,y=GB_FB)) + geom_point(color =ifelse(((abs(HomeRuns$GB_FB)>1.5)),"pink", "black")) + geom_smooth(method = 'lm')

Just wanted to make this graph more appealing to the eye and show the importance to keeping the ball in air even more.

ggplot(HomeRuns,aes(x=HR,y=LDPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=GBPct)) + geom_point() + geom_smooth(method = 'lm')

ggplot(HomeRuns,aes(x=HR,y=FBPct)) + geom_point() + geom_smooth(method = 'lm')

Groundballs definitely do not equate to home runs, and I think this will be the key negative number variable in a model that I’ll allow negaive variables in, I also have a feeling that people who hit a lot of flyballs will lead FBPct be in my key best model, even though the trend line looks skewed upwards a bit here. I’m actually surprised about line drives having that much of a negative effect because they usually indicate a very good hitter, but it looks like I’m going nowhere with that.

ggplot(HomeRuns,aes(x=HR,y=HR_FB)) + geom_point() + geom_smooth(method = 'lm')

Probably not going to use this in my model since it directly involves home runs, but there’s a league average for this number and chances are if a player is above the trendline that there homers will decline the next season, or they might consistently outperform this average due to an uppercut swing or playing in a park that favors hitters and so on.

Overall, I have a pretty good feel on my variables and what I’m working with now and definitely a better idea of what models I want to build, even if I don’t get a really great model I think I can learn some important things about these numbers and how they can project home runs, but the goal is to build a great model to use for fantasy baseball predictions (and then to get an A on this project and signed to an MLB team to do statistical analysis).

save(HomeRuns, file = "HomeRunsNewdf.rdata")